-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Run single phase aggregation when possible #131485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d342799 to
48274a9
Compare
48274a9 to
dcbe09f
Compare
264d405 to
21e3afc
Compare
21e3afc to
105fe9d
Compare
|
Hi @dnhatn, I've created a changelog YAML for you. |
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
| rate_bytes_in:double | time_bucket:datetime | ||
| null | 2024-05-10T00:01:00.000Z | ||
| 28.80297619047619 | 2024-05-10T00:15:00.000Z | ||
| 28.802976190476194 | 2024-05-10T00:15:00.000Z |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side effect of running a single-phase aggregation for avg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heya, cool stuff!
A little bit of a drive-by, but I noticed that there is opportunity for simplification, please have a look at my comment.
That said, I see that you requested review from @fang-xing-esql and @ivancea - folks, it'd be great if (one of) you could remain main reviewer, as I won't be able to deeply review this atm. Thank you :)
| * For example, in FROM .. | STATS first | STATS second, the STATS second aggregation | ||
| * can be executed in a single phase on the coordinator instead of two phases. | ||
| */ | ||
| public class SinglePhaseAggregate extends PhysicalOptimizerRules.OptimizerRule<AggregateExec> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work, but it's actively working against code in the mapper that already distinguishes between "happens on the coordinator, only" vs. "happens partially on data nodes", see here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me! The new rule looks sensible in any case. But as I understand it, what Alex comments would be the ideal way: Not splitting in phases to begin with, and splitting aggs only when needed.
Maybe this is ok too, as the rule can be easily removed later. But I would investigate the other way. I believe it will be quite more complex tho...
Edit: Btw, this was 3 months old? Did something change since then that enables this PR again, or was it just "forgotten" there?
|
@alex-spies @ivancea I started with Mapper, but that class is quite fragile, so I added a separate rule instead. I will try to get this in for time-series soon, and then follow up by replacing the rule with changes in Mapper. Thanks for reviewing. |
I started this three months ago for time-series, but didn't spend much time on it since the difference was smaller. Now, the impact is more noticeable for low-latency queries and queries with large final results. |
Follow-up to #131485. Interestingly, the physical plan optimizer tests seem to pick up more cases when we map to a SINGLE agg directly.
A STATS command following a reduce operation, such as another Stats, Limit, TopN, Inline Stats, or Fork, is executed on the coordinator and can be performed in a single phase. This is useful for time-series queries where we execute two aggregations: first per time-series, then across time-series.